[SPARK-12936][SQL] Initial bloom filter implementation #10883

cloud-fan · 2016-01-23T22:17:08Z

This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the BloomFilter class in guava.

Some difference from the design doc:

expose bitSize instead of sizeInBytes to user.
always need the expectedInsertions parameter when create bloom filter.

cloud-fan · 2016-01-23T22:18:18Z

cc @rxin @liancheng

SparkQA · 2016-01-23T22:35:30Z

Test build #49939 has finished for PR 10883 at commit bbf3822.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public abstract class BloomFilter
- public class DefaultBloomFilter extends BloomFilter
- static final class BitArray

SparkQA · 2016-01-23T23:00:48Z

Test build #49940 has finished for PR 10883 at commit 2db0171.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public abstract class BloomFilter
- public class DefaultBloomFilter extends BloomFilter
- static final class BitArray

SparkQA · 2016-01-23T23:26:16Z

Test build #49941 has finished for PR 10883 at commit 1617226.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public abstract class BloomFilter
- public class DefaultBloomFilter extends BloomFilter
- static final class BitArray

rxin · 2016-01-24T04:33:14Z

common/sketch/src/main/java/org/apache/spark/util/sketch/DefaultBloomFilter.java

+import java.io.UnsupportedEncodingException;
+import java.util.Arrays;
+
+public class DefaultBloomFilter extends BloomFilter {


we should be consistent in naming with the count-min sketch one, i.e. rename this BloomFilterImpl.

cloud-fan · 2016-01-25T18:55:33Z

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java

+    return bits.bitSize();
+  }
+
+  private static long hashObjectToLong(Object item) {


The string part is same with count-min, but long part is different, is it worth to abstract it? cc @rxin

Why is the long part different?

because they use different strategy to produce n hash values for a long.

let's add some inline comment explaining why they are different

SparkQA · 2016-01-25T19:10:56Z

Test build #50002 has finished for PR 10883 at commit 920f292.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-01-25T19:56:47Z

common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala

+      while (i < numInsertion) {
+        filter.put(allItems(i))
+        i += 1
+      }


I think it's OK to use allItems.take(numInsertion).foreach(filter.put) for simplicity if this code path doesn't slow down test execution too much.

SparkQA · 2016-01-25T20:23:19Z

Test build #50005 has finished for PR 10883 at commit 3633952.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-25T21:34:05Z

Test build #50012 has finished for PR 10883 at commit 4fce26e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-01-25T21:44:59Z

common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala

+      // After merge, `filter1` has `numItems` items which doesn't exceed `expectedNumItems`, so the
+      // `expectedFpp` should not be significantly higher than the default one: 3%
+      // Skip byte type as it has too little distinct values.
+      assert(typeName == "Byte" || 0.03 - filter1.expectedFpp() < 0.001)


Does the 0.001 here stand for epsilon? Should we wrap 0.03 - filter1.expectedFpp() with Math.abs()?

Do we still need to special case "Byte" here when we already have numItems (it's 200 for Byte type)?

Would be nice to make 0.01 a private static constant, e.g. EPSILON.

Actually we should swap 0.03 and filter1.expectedFpp() here.

liancheng · 2016-01-25T21:48:26Z

One high level comment about testing code: I'm a little bit confused by those magic numbers there. Would be nice to name them.

liancheng · 2016-01-25T21:50:36Z

common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java

+   * Creates a {@link BloomFilter} with given {@code expectedNumItems} and a default 3% {@code fpp}.
+   */
+  public static BloomFilter create(long expectedNumItems) {
+    return create(expectedNumItems, 0.03);


Maybe we should have a public static constant DEFAULT_EXPECTED_FPP for this 0.03. I was pretty confused when finding the magic number 0.03 in the testing code.

liancheng · 2016-01-25T22:25:27Z

Tip: I renamed CountMinSketchMergeException to IncompatibleMergeException and made it checked in #10893. You can use it in mergeInPlace.

liancheng · 2016-01-25T22:26:21Z

common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala

+  testItemType[Long]("Long", 100000) { _.nextLong() }
+
+  testItemType[String]("String", 100000) { r => r.nextString(r.nextInt(512)) }
+}


Would be nice to add another test case for incompatible merge (like this one).

SparkQA · 2016-01-25T23:29:34Z

Test build #50031 has finished for PR 10883 at commit b850bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-25T23:51:11Z

Test build #50034 has finished for PR 10883 at commit a9a6e83.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- \"Cannot merge bloom filter of class \" + other.getClass().getName()

liancheng · 2016-01-26T01:46:21Z

LGTM

rxin · 2016-01-26T01:57:54Z

As discussed offline, it might be better to have put(String) and put(Long) in addition to put(Object).

I'm going to merge this pull request. Please address the remaining comments in a new pull request.

cloud-fan force-pushed the bloom-filter branch from 58e7c59 to cbc53f2 Compare January 23, 2016 22:18

cloud-fan force-pushed the bloom-filter branch from cbc53f2 to bbf3822 Compare January 23, 2016 22:19

cloud-fan force-pushed the bloom-filter branch from bbf3822 to 2db0171 Compare January 23, 2016 22:47

cloud-fan force-pushed the bloom-filter branch from 2db0171 to 1617226 Compare January 23, 2016 23:04

Initial bloom filter implementation

1617226

rxin reviewed Jan 24, 2016
View reviewed changes

cloud-fan force-pushed the bloom-filter branch from 69ac65d to 920f292 Compare January 25, 2016 18:53

address comments

920f292

cloud-fan reviewed Jan 25, 2016
View reviewed changes

fix style

3633952

liancheng reviewed Jan 25, 2016
View reviewed changes

add comment

4fce26e

liancheng reviewed Jan 25, 2016
View reviewed changes

cloud-fan added 3 commits January 25, 2016 15:09

address comments

b850bfd

Merge remote-tracking branch 'origin/master' into bloom-filter

2affab6

add error merge test

a9a6e83

asfgit closed this in 109061f Jan 26, 2016

cloud-fan deleted the bloom-filter branch January 26, 2016 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12936][SQL] Initial bloom filter implementation #10883

[SPARK-12936][SQL] Initial bloom filter implementation #10883

cloud-fan commented Jan 23, 2016

cloud-fan commented Jan 23, 2016

SparkQA commented Jan 23, 2016

SparkQA commented Jan 23, 2016

SparkQA commented Jan 23, 2016

rxin Jan 24, 2016

cloud-fan Jan 25, 2016

rxin Jan 25, 2016

cloud-fan Jan 25, 2016

rxin Jan 25, 2016

SparkQA commented Jan 25, 2016

liancheng Jan 25, 2016

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

liancheng Jan 25, 2016

liancheng Jan 25, 2016

liancheng Jan 25, 2016

liancheng Jan 25, 2016

liancheng commented Jan 25, 2016

liancheng Jan 25, 2016

liancheng commented Jan 25, 2016

liancheng Jan 25, 2016

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

liancheng commented Jan 26, 2016

rxin commented Jan 26, 2016

[SPARK-12936][SQL] Initial bloom filter implementation #10883

[SPARK-12936][SQL] Initial bloom filter implementation #10883

Conversation

cloud-fan commented Jan 23, 2016

cloud-fan commented Jan 23, 2016

SparkQA commented Jan 23, 2016

SparkQA commented Jan 23, 2016

SparkQA commented Jan 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Jan 25, 2016

Choose a reason for hiding this comment

liancheng commented Jan 25, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2016

SparkQA commented Jan 25, 2016

liancheng commented Jan 26, 2016

rxin commented Jan 26, 2016